Update default naflexvit positional embedding interpolation mode to bilinear #2543

drhead · 2025-07-09T18:48:46Z

The default positional embedding interpolation mode for the NaFlex SigLIP2 model doesn't match what is used for SigLIP2 in its official implementations. In fact, looking at the Transformers implementation, there's not even an option to have it as anything but bilinear: https://github.com/huggingface/transformers/blob/2781ad092dad77ff554cb70ec130b97e44cfba78/src/transformers/models/siglip2/modeling_siglip2.py#L174

Bilinear is probably the more appropriate default since it is what is used in official implementations of SigLIP 2. Having it on the wrong mode causes significant deviation in cosine similarity of outputs between the Transformers and TIMM implementations of SigLIP 2, but setting the mode to bilinear and using an input image that doesn't need resizing (to avoid preprocessing discrepancies) results in identical intermediate and final outputs between the two implementations.

redhottensors · 2025-07-09T18:50:38Z

Haha. Looks like I was a bit faster by being less helpful. #2542

redhottensors · 2025-07-09T18:54:23Z

I do not think that this is the correct place to make the change though. See the issue I opened.

rwightman · 2025-07-09T18:56:41Z

@drhead you cannot change the default globally, only the config for siglip specific models can be changed.

I would like some convincing data for the change. I do not care about differences between the transformers and timm impl, only between the original jax models and timm... if torch bicubic is not convincingly different from jax bilinear than I will leave it as bicubic as I've found it to be more robust overall ... from my zero-shot eval comparisons there wasn't a convincing argument either way.

redhottensors · 2025-07-09T19:00:31Z

@drhead you cannot change the default globally, only the config for siglip specific models can be changed.

I would like some convincing data for the change. I do not care about differences between the transformers and timm impl, only between the original jax models and timm... if torch bicubic is not convincingly different from jax bilinear than I will leave it as bicubic as I've found it to be more robust overall ... from my zero-shot eval comparisons there wasn't a convincing argument either way.

Sorry, our team has higher priorities than getting this set up with JAX.

rwightman · 2025-07-09T19:05:18Z

@redhottensors okay, well until someone has time to verify I'll stick with my original analysis that bicubic is the best choice. There are differences between torch and jax interpolation modes of the same 'type'. I evaluated both in zero-shot and bicubic appeared to 'win' across a few scenarios.

It is expected that the timm and transformers impl would be very very close numerically (aside from this difference), but what's more important in these decision is what works best in comparison to the original in numerous downstream use cases. I will take another look when I get back to integrating with OpenCLIP.

drhead · 2025-07-09T19:51:49Z

closing since I would agree this is a documentation bug as #2542 was updated to reflect. at least as far as what I know on this issue goes:

I know torch's interpolate algorithm is a bit of an odd outlier since it does both axes in one kernel, where most use a separable kernel working on one axis at a time. JAX as far as I can tell from a cursory look is doing a naive matmul which while very inefficient does work fine. Nvidia DALI uses a separable kernel and is significantly faster than pytorch's interpolate. I never looked in much detail at how torch's implementation works, but I suspect it'd have to be more than just rounding errors here.
I recall a paper some time ago which I cannot for the life of me find, which was on resizing image inputs with a learned resampling kernel, which helped some classification problems. For that reason I wouldn't be surprised if something other than bicubic was better but it rightfully shouldn't be behaving better than what the model was trained on, unless we're talking about training with all of the model weights unfrozen rather than just the head.

rwightman · 2025-07-09T20:04:51Z

@drhead yup, agreed with a above, but I've pointed this out to others before, 'what the model was trained on' is the JAX implementation of 'bilinear', not the torch impl of 'bilinear', and when considering the image interpolation preprocessing we actually have for the original:

tf.image.resize 'bilinear' for the input image
jax.image.scale_and_translate + 'bilinear' for the position embedding.

And in torch we will have PIL or torchvision impl of ? + torch.nn.functional.interpolate of ? ... simply matching strings doesn't necessarily get you the best match or end result given that all of the implementations differ. If the implementations of bilinear -> bilinear across frameworks are sufficiently different, than I usually use bicubic as it tends to behave as good (given the differences) or better across more size ranges, scenarious.

Update default naflexvit pos emb interp mode to bilinear

fae8589

redhottensors mentioned this pull request Jul 9, 2025

[BUG] naflexvit_so400m_patch16_siglip has undocumented different default pos_embed_interp_mode of "bicubic" instead of "bilinear" #2542

Open

drhead closed this Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update default naflexvit positional embedding interpolation mode to bilinear #2543

Update default naflexvit positional embedding interpolation mode to bilinear #2543

Uh oh!

drhead commented Jul 9, 2025

Uh oh!

redhottensors commented Jul 9, 2025

Uh oh!

redhottensors commented Jul 9, 2025

Uh oh!

rwightman commented Jul 9, 2025

Uh oh!

redhottensors commented Jul 9, 2025

Uh oh!

rwightman commented Jul 9, 2025

Uh oh!

drhead commented Jul 9, 2025 •

edited

Loading

Uh oh!

rwightman commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Update default naflexvit positional embedding interpolation mode to bilinear #2543

Update default naflexvit positional embedding interpolation mode to bilinear #2543

Uh oh!

Conversation

drhead commented Jul 9, 2025

Uh oh!

redhottensors commented Jul 9, 2025

Uh oh!

redhottensors commented Jul 9, 2025

Uh oh!

rwightman commented Jul 9, 2025

Uh oh!

redhottensors commented Jul 9, 2025

Uh oh!

rwightman commented Jul 9, 2025

Uh oh!

drhead commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwightman commented Jul 9, 2025

Uh oh!

Uh oh!

drhead commented Jul 9, 2025 •

edited

Loading